Data augmentation using Variational Autoencoders for improvement of respiratory disease classification

Computerized auscultation of lung sounds is gaining importance today with the availability of lung sounds and its potential in overcoming the limitations of traditional diagnosis methods for respiratory diseases. The publicly available ICBHI respiratory sounds database is severely imbalanced, making it difficult for a deep learning model to generalize and provide reliable results. This work aims to synthesize respiratory sounds of various categories using variants of Variational Autoencoders like Multilayer Perceptron VAE (MLP-VAE), Convolutional VAE (CVAE) Conditional VAE and compare the influence of augmenting the imbalanced dataset on the performance of various lung sound classification models. We evaluated the quality of the synthetic respiratory sounds’ quality using metrics such as Fréchet Audio Distance (FAD), Cross-Correlation and Mel Cepstral Distortion. Our results showed that MLP-VAE achieved an average FAD of 12.42 over all classes, whereas Convolutional VAE and Conditional CVAE achieved an average FAD of 11.58 and 11.64 for all classes, respectively. A significant improvement in the classification performance metrics was observed upon augmenting the imbalanced dataset for certain minority classes and marginal improvement for the other classes. Hence, our work shows that deep learning-based lung sound classification models are not only a promising solution over traditional methods but can also achieve a significant performance boost upon augmenting an imbalanced training set.


Introduction
The inception of deep learning has paved the way for many breakthroughs in science, medicine, and engineering. Deep learning is rapidly developing in the field of acoustics, providing many compelling solutions to problems like Automatic Speech Recognition [1], sound synthesis [2], acoustic scene classification [3], generative music [4], acoustic event detection [5], and many more. The increase in access to data generated by health care and the development of deep learning techniques has proved to be prosperous. Deep Learning can be used to unravel the clinically pertinent information from the dataset provided by hospitals, which can be then used for decision making, treatment, and prevention of health conditions. For efficacious training of neural networks, sufficient data is required. In the case of a low data scheme, the parameters are underdetermined, and the networks generalize poorly. Data Augmentation techniques enhance the generalizability of the deep learning model by using the original data efficiently. Standard data augmentation techniques like random translations, flips, and rotations, and the addition of Gaussian noise generates authentic but very little data.
Variational Autoencoders have been used to generate synthetic samples and improve the performance of the classifiers [6,7]. The creation of synthetic data for minority classes can be useful for many reasons. The first reason for creating synthetic samples is to avoid the usage of original samples for privacy reasons. It is highly possible that working directly on the database containing the original samples can be misused or breached. The second reason is oversampling of the minority classes. In real-life scenarios, there is always a possibility of certain classes being underrepresented.
Respiratory diseases are a pathological condition that affects the tissues and organs which consequently makes gas exchange difficult. Some of the factors contributing to respiratory diseases are air pollution, tobacco smoke, occupational chemicals, dust, etc [8]. Respiratory diseases are one of the leading causes of death and disability in the world. Improving respiratory health not only entails reducing tobacco consumption or controlling the unhealthy air at the workplace but also involves strengthening the health care system and detection of respiratory disease via breathing sounds is a compelling addition.

Literature review on automated respiratory sounds auscultation algorithms
The development of automated respiratory sound auscultation algorithms requires the procurement of respiratory sounds either through various devices or publicly accessible databases, annotating these audios, pre-processing, followed by feature extraction and classification of these audios.
The acquisition of respiratory sounds from patients at clinics is achieved using various equipment like microphones or stethoscopes. The development of mobile technology has made it possible to acquire clinical samples and diagnoses. [9] proposed a modified stethoscope to record lung sounds. Further, the introduction of mHealth applications has made it possible to monitor vital organs, track treatment progress and send patient records to hospitals. Few readily available respiratory sounds databases include the Multimedia Database MARS (Marburg Respiratory Sounds), RALE Repository (Respiratory Acoustic Laboratory Environment), ICBHI (International Conference on Biomedical and Health Informatics) 2017 Scientific Challenge Dataset, etc. The Marburg Respiratory Sounds Database has over 5000 audio recordings from patients with COPD, bronchial Asthma, lung fibrosis and Pneumonia, along with clinical parameters such as lung function, X-ray findings and laboratory results. The RALE Repository provides a collection of sounds recorded by a stethoscope from the chest of patients. It was created for medical students to gain experience in lung sounds auscultation. The ICBHI Respiratory Sounds Dataset contains 5.5 hours of respiratory sounds obtained from 128 patients belonging to 8 different classes (1 healthy + 7 diseases).
The development of computerized lung sound auscultation algorithms requires annotated audio data. These annotations are valuable in pre-processing steps such as the segmentation of lung sounds and for the development of supervised learning algorithms. Respiratory sound annotation software makes it easier for medical experts to annotate lung sounds by providing an intuitive interface for quickly identifying respiratory cycles and valuable sounds in the waveform such as wheezes and crackles. The development of Graphical User Interface (GUI) based applications such as LungSounds@UA has made it easier for users to interact with multimedia databases that store respiratory sounds and associated clinical parameters.
Pre-processing the acquired audio data is an essential step before feeding the data to a classification model. This is because the pre-processing techniques applied to the data affect the training efficiency and performance of a machine/deep learning model. Neural networks can represent any function. However, it may not effectively learn any input unless the correct preprocessing techniques have been used. Some of the pre-processing methods used for lung sounds include denoising, segmentation, normalization, resampling, spectrogram conversion, padding, trimming, etc. Lung sounds acquired from various devices are often corrupted by different noise sources such as heartbeats, speech, electrical interference, etc. [10] discusses denoising filters such as Butterworth Bandpass filters, wavelet transform based filters, FIR filters, etc. and compares their efficacies in eliminating certain types of noise from lung sounds. [11] utilized a local polynomial regression smoother for the removal of displacement artefacts in lung sounds. Resampling refers to modifying the sampling rate of the audio signals. Downsampling audio signals reduce the computational complexity and inference time for classification models without compromising their performance significantly. Segmentation algorithms identify the beginning and end of a respiratory cycle in the time domain. Spectrograms are two-dimensional audio representations helpful in extracting features using deep learning techniques such as CNN and its variants. Frequency ranges in which the audio signals have zero energy can be cropped from the spectrogram, allowing neural networks to focus on regions of interest thereby, leading to improved performance. Lastly, most neural networks often expect fixed length inputs and padding techniques such as zero-padding, duplicating respiration cycles are used to achieve a fixed-length input. Trimming eliminates any silence at the beginning or end of the audio.
Training machine and deep learning models on audio data requires a feature extraction step. This step transforms the raw audio into valuable features for each class that can be useful to the model in distinguishing between various classes. Audio features can be extracted from multiple domains such as time, frequency, time-frequency, entropy, spectral, statistical, wavelet, etc. [12] proposed a five-dimensional feature vector consisting of four time-domain features derived from Simple Moving Averages of a 92 ms window and one frequency domain feature (spectrum mean) to classify crackles. [13] proposed a feature vector consisting of averaged power-spectrums of 32 segments of a signal for classifying ten different samples of respiratory sounds. [14] decomposed the lung sounds into frequency sub-bands using wavelet transformed and extracted statistical features associated with each sub-band to represent the wavelet distribution. [15] used Mel Frequency Cepstral Coefficients that describes a signal's cepstrum for classifying lung sounds with Artificial Neural Networks. [16] constructed Autoregressive models using Linear Predictive Coding, a time-domain estimator of the signal based on a linear combination of previous samples weighted by LPC coefficients, for automated diagnosis of lung sounds. [17] used features such as percentile frequency ratios, mean crossing irregularity, kurtosis, Renyi entropy, etc. for the classification of wheeze and non-wheeze sounds. [18] proposed entropy-based features for the detection of four types of lung sounds and found the method to be robust against additive white Gaussian noise. [19] computed the "Wheeze Power Ratio", a ratio of the power of the maximum peak in the frequency range of 250-800 Hz to that of the mean power in the frequency range between 60-900 Hz and found it to be effective and robust in classifying wheeze sounds. [20] used audio spectral envelope and tonality index as discriminative features for wheezy sounds and found the method to be computationally simple, effective in detecting weak wheezes and robust against noise. The technique was suitable for automated auscultation of Asthma sounds acquired remotely from mobile devices where the wheezes may have weak intensity. Hence, deep-learning-based and manual feature extraction techniques have been proposed for classifying respiratory sounds.  Table 1 collectively summarizes the commonly used pre-processing, feature extraction techniques, and classification models applied to lung sound signals. Table 1 categorizes the research works depending on the framework used i.e. either anomaly-driven (classifying respiratory cycles as crackles, wheezes, both or none) or pathology-driven classification (detection of specific diseases).

Literature review on audio data augmentation techniques
Data Augmentation is a technique that attempts to create new samples from existing samples with features that resemble the original samples in the training set. The generated samples vary slightly from the original training data and this helps a classifier to generalize better for certain classes. Data Augmentation techniques can be categorized into two types for audio data: the first is standard data augmentation techniques like Time stretching, pitch shifting, addition of white noise [28], signal speeding, frequency masking, time masking [29], etc and the second technique involves generating samples from generative models like Variational Autoencoders [30] and Generative Adversarial Network [31]. Table 2 summarizes the recent works undertaken towards augmenting audio datasets to enhance audio classification results.

Exploratory data analysis of ICBHI dataset
This section explores the various aspects of the respiratory diseases database provided by the International Conference on Biomedical Health Informatics. The database consists of 5.5 hours of annotated recordings with respiratory cycles containing crackles, wheezes or both [38]. Crackles are discontinuous sounds representing the presence of the fluid or secretions, whereas wheezing refers to high pitched whistling sound that occurs when a breath is taken [39,40]. Wheezing is generally found in patients with Asthma and can be heard very loudly. On the other hand, crackles are heard through a stethoscope and are generally a sign of fluid in the lungs. Early expiratory crackles occur in people with severe airway obstruction. These types of crackles mainly occur in COPD and Asthma. Crackles are further categorized as fine and coarse. Fine crackles occur in pulmonary edema, Pneumonia, and fibrosis cases, whereas coarse crackles occur in cases of Bronchiectasis [39]. Fig 2 shows the distribution of crackles and wheezes in respiratory cycles of various diseases.
The pie chart in Fig 3 shows the distribution of the patients with different respiratory diagnoses. The total number of patients who participated in the study was 126 from various age categories and gender.
The study produced 920 annotated audio samples belonging to various classes as shown in Each audio file has more than one respiratory cycle of various lengths, which could be determined by the annotations provided in the dataset. The number of respiratory cycles for various diagnoses is shown in Fig 5. The total number of respiratory cycles was 6898, of which 5641 cycles belonged to the COPD class. Fig 5 shows that the dataset is highly imbalanced. The volume of data for certain classes is negligible; hence there is a need to use data augmentation methods to ensure that the deep learning classifiers perform well.  Fig 6 shows the split of the real audio segments into train and test sets. The stacked bar chart reveals that nearly 70% of the audio segments in each class were used to train the VAEs and classification models. The remaining segments were reserved for testing/performance evaluation of the classifiers.

Research gaps
1. Existing research works [21][22][23][24][25] have shown that machine and deep learning techniques hold promise in automating respiratory sounds auscultation. However, developing a machine/deep learning model with good generalization ability, especially towards the minority/disease classes, requires a large volume of data to learn the desired characteristics of these classes. The total number of respiratory cycles for certain classes in the ICBHI dataset is negligible, making it impossible to train data-hungry deep learning models. Hence, a need for techniques that enable training deep neural networks with limited data has arisen. Data Augmentation is one such technique that can help deep learning models combat overfitting in scenarios where certain classes are highly under-represented. We observed that very few works in the literature have experimented with the impact of data augmentation on the performance of respiratory sound classification models. Hence, this work has explored the efficacy of data augmentation using VAEs in enhancing lung sound classification performance.
2. Table 1 reveals that most works [21-24, 26, 27] on respiratory sounds classification have adopted an anomaly-detection based framework. Very few studies have focused on

PLOS ONE
Data augmentation using Variational Autoencoders for improvement of respiratory disease classification classifying multiple respiratory diseases, probably due to insufficient labelled audio data. In this study, we aimed at achieving an acceptable classification performance metric for six disease classes present in the ICBHI dataset. Neural networks have the potential to learn sophisticated and robust feature representations of the disease classes. However, the small number of audio samples in the ICBHI dataset limits the creation of multi-class classification models. Hence, to leverage the power of deep learning for diagnosing multiple respiratory diseases, we need a large number of labelled samples for the disease classes. Creating a comprehensive database with a large volume of data for the minority classes is time-consuming, costly and patients belonging to these classes might not be available. Hence, generative modelling techniques for synthesizing samples of the minority/disease classes are required.
3. Standard audio augmentation techniques such as time-shifting, pitch shifting, noise addition, mixup, speed scaling, frequency masking, etc. can be used to create limited audio samples for the minority classes, both in quantity and variety, which may not be sufficient to significantly enhance the respiratory sound classification performance metrics or generalization ability. On the other hand, generative models such as VAEs and GANs that are less explored for audio synthesis in the healthcare domain can create a much wider set of augmentations that can lead to more robust classifiers. Lastly, we noticed that GANs were more prevalent in literature for audio generation than VAEs in various domains. However, training GANs is challenging due to instabilities in the training process and mode collapse [41]. Very few works [30] have reported generating biomedical signals such as respiratory sounds with VAEs. Moreover, we addressed three major issues that we noticed in [30]: 1. Inclusion of test set data for training the generative model (data leakage), 2. Lack of quantitative and qualitative assessment for the quality and variety of the synthetic samples and 3.
Performance evaluation of the classifiers on the synthetic samples. These issues led to optimistic performance metrics on the test set, which could be misleading. Hence, our work provides an accurate insight into the potential of VAEs for augmentation and their role in enhancing classification performance metrics for respiratory sounds.

Methodology
Section 3 highlights the severity of the imbalance in the audio dataset. To address the issue, Variational Autoencoders were proposed to generate synthetic samples. Our work consists of two major steps: 1. generating synthetic samples for various respiratory classes and 2. Classifying the samples using deep learning models. As shown in Fig 7, after the audio acquisition, data pre-processing involves segmentation of the audio samples and padding the audio signals to make them of the same duration. We aim to create a comparison in the performance of the classifiers before and after augmentation. For the classification of respiratory diseases, Mel Frequency Cepstrum Coefficients are obtained in feature extraction steps which are then passed to the proposed classification models. We propose three different variational autoencoders, namely Multilayer Perceptron VAE, Convolutional VAE and Conditional VAE. Unlike Generative Adversarial Networks, Variational Autoencoders do not take raw audio samples; hence the feature extraction step is necessary. Mel Spectrograms are considered for this step instead of Mel Frequency Cepstrum Coefficients because the former can be easily converted into audio samples.

Audio acquisition
The dataset consists of audio samples which were obtained by two research teams over a different span and geographical location [38]. As indicated in Fig 2, the recordings can contain 6.2.2 Padding. As shown in Fig 8, the histogram peaks at 2.5 seconds and the majority of the respiratory cycles i.e. 6779 respiratory cycles lie before 6 seconds duration. We discarded those respiratory cycles which were above the length of 6 seconds and padded the respiratory cycles less than 6 seconds with silence in the end to ensure that all the segments were of the same length. The padded audio waveforms for different respiratory classes are shown in Fig 9.

Data augmentation
Variational Autoencoders belong to the family of Generative Models. Variational Autoencoders bear a huge similarity to Autoencoders in terms of their design. They consist of an encoder (a recognition or inference model) and a decoder, also known as a generative model. Another similarity between the two is they attempt to reconstruct the input data while learning from the latent vectors. However, the difference between the two is that the latent space of VAE is continuous. This is achieved as the encoder does not output an encoding vector of The purpose of the mean vector is to control the center location of the encoded input whereas the purpose of the standard deviation vector is to control the area in which the encoding can vary. Since the encoding is generated randomly in the circle, the decoder also exposes itself to variations of encoding.
Since the proposed Variational Autoencoders do not take raw audio as input, it is essential to carry out a feature extraction step. In this paper, we use Mel Spectrogram as a feature extraction step before the data augmentation process takes place. Mel spectrogram is a spectrogram where the frequencies are converted into the Mel scale. Let be the training data where each x i 2 R d represents a mel spectrogram belonging to a given respiratory class. The encoder in the VAE learns a mapping Q θ (z|X) from an input x i to the mean μ(x i ) and covariance σ 2 (x i ) vectors of the latent variables. VAEs assume  that the latent variable z follows a standard normal distribution N(0, 1). The decoder needs to sample from the latent distribution outputted by the encoder, since the encoder no longer outputs a latent representation z as in autoencoders. The decoder of a VAE learns a mapping P ϕ (X|z) from the latent representation z to the distribution parameters of the training data X. For the decoder to generate realistic samples, it needs to maximize the log likelihood of the input data i.e. log (p(X)). We also need to ensure that the VAE does not memorize/overfit the training data. This is done by adding a regularization term to the loss function of the VAE that constrains the Gaussian distribution outputted by the encoder to be close to a standard normal distribution. A standard normal distribution has an identity covariance matrix which implies that the covariance between the latent variables is zero and they are independent of each other. Hence, a VAE which learns to model the training data distribution p(X) with an encoder Q θ (z| X), a decoder P ϕ (X|z) and a latent distribution p(z) is trained using the objective function (1) where θ, ϕ are the parameters of the encoder and decoder respectively and D KL represents the Kullback-Leibler Divergence between two probability distributions.
The following steps obtain the Mel spectrograms, which are fed to the variational autoencoder: 1. Sampling the input with the window size of n_fft as 2048 while making hops of 512 to sample the next window. We propose three variants of Variational Autoencoders for tackling data imbalance. These are Multilayer Perceptron VAE (MLP-VAE), Convolutional VAE (CVAE) and Conditional CVAE. The architectural details, along with hyperparameter settings, are discussed in the following subsections. All the VAE models were trained on mel spectrograms of the minority classes. The encoders of the VAEs aimed at learning a mapping from the high-dimensional mel spectrograms to low-dimensional latent representations, which in turn can be used by the decoder to reconstruct a realistic mel spectrogram for a given minority class. As highlighted in Eq (1), the decoder learns to generate mel spectrograms similar to the real spectrograms by minimizing the reconstruction loss (Mean Squareddd Error in our case). On the other hand, the encoder is incentivized to output normally distributed latent variables by the regularization term (KL divergence between the distribution of latent variables and a standard normal distribution in our case). To control the outputs generated by the unconditional decoders, we instantiated a new VAE model per minority class. Each instance of the unconditional models was trained on only one of the minority classes, thus ensuring that all synthetic mel spectrograms generated by an unconditional model belong to the same class only. On the other hand, the Conditional VAE model was trained on all six minority classes simultaneously. Upon completing training for an instance of a VAE model, we generated synthetic mel spectrograms using the decoder network by inputting a latent vector z sampled from a standard normal distribution and a class label for the Conditional VAE.  Fig 12. 6.3.2 Convolutional VAE. The encoder is made up of two convolutional layers and three intermediate dense layers to generate the latent code. The kernel size used in the convolutional layers is three, numbers of strides were assigned as two and filters were varied depending on the number of convolutional layers i.e. there were 16 for the first layer and 32 for the second layer. The decoder has one dense layer, a reshape layer which is then given as an input to the three Conv2DTranspose layers responsible for upsampling and reconstructing the image to its original dimension. The filters in Conv2DTranspose layers vary with the layers and are assigned 32, 16 and 1, respectively. The strides and kernel size for the Conv2DTranspose were assigned 2 and 3, respectively. The activation function assigned to the first two layers was Rectified Linear Activation function or ReLU and sigmoid activation function for the last upsampling layer. The optimizer used in this case was RMS Prop. The network configuration of the VAE is indicated in Fig 13. 6.3.3 Conditional VAE. If the latent space is randomly sampled, it becomes impossible for the Variational Autoencoder to control the class of the sample being generated.

PLOS ONE
Data augmentation using Variational Autoencoders for improvement of respiratory disease classification Conditional VAE addresses the issue by including a one hot label or a condition into the encoder and the decoder. The reconstruction loss of the decoder and KL loss of the encoder is determined by latent vector and the condition. Conditional VAE is a special type of β-VAE (Disentangled VAE) where β = 1. The network structure of the VAE consists of an encoder which has a one hot vector of the labels provided as the input along with 3 Dense layers, 2 Convolutional layers with varying filters i.e. 16 and 32, kernel size as 3 and latent dimension as 2. The structure of the decoder consists of one hot vector of the class labels and 2 dimension input obtained from the lambda function as the input. It consists of 1 Dense layer and 3 Con-v2DTranspose Layers with filters 32, 16 and 1 respectively. The optimizer used in this case was RMS Prop. The network configuration of the VAE is indicated in Fig 14. Variational Autoencoder generates reconstructed Mel spectrograms which are converted into audio by applying Griffin-Lim [42]. The numbers of audio samples generated for the minority classes by various variational autoencoders are shown in Table 3.

Feature extraction for classification models
Before feeding the data into neural networks for the classification of respiratory diseases, it is essential to extract the required features from the data as it reduces the computation time of the classification models. We propose the use of Mel Frequency Cepstrum for the extraction of features from audio. This technique involves windowing the signal, applying direct Fourier transformation, taking log of magnitude and warping the frequencies on the Mel scale, and finally applying inverse Direct Cosine Transformation. Fig 15 shows the steps of computing the Mel Frequency Cepstral Coefficients for an audio signal.
We gave the value of the number of MFCC as 13 and sample rate of each audio file as 22050. Fig 16 visualizes the MFCCs over time for various respiratory classes in the dataset.

Classification models
In this paper, we have proposed five deep learning architectures for the classification of various respiratory diseases. The architectural details and hyperparameter settings of these models are discussed in the following subsections and Table 4, respectively. The last layer of these models is a dense layer with seven neurons and the softmax activation function that outputs the class probabilities corresponding to the seven respiratory classes. The classification models were compiled with categorical cross-entropy loss function, Adam optimizer and a learning rate of 0.0001. The classification models were trained on the imbalanced and augmented training sets and evaluated using metrics such as specificity, sensitivity, precision and F1-score.

PLOS ONE
Data augmentation using Variational Autoencoders for improvement of respiratory disease classification 6.5.1 Multilayer perceptron. As shown in Fig 17, the model takes flattened MFFC as input which is then fed to the seven dense layers which have dropout in the range of 10% to 40% in order to reduce overfitting.

PLOS ONE
Data augmentation using Variational Autoencoders for improvement of respiratory disease classification 6.5.2 Convolutional neural network. The second model used for detection of respiratory diseases is CNN which has a sequential framework. The model has two Conv2D layers which were used for feature extraction, with five intermediate dense layers and a dense layer as the output layer. Convolutional Neural Networks have the ability to capture the spatial and temporal dependencies from the input, specifically an image through the application of various filters. It is preferred because of the reduction in the number of parameters involved and option of reusability of the weights. In short, Convolutional Neural networks reduce the mel-spectrograms, into a form that is easier to process and does not lose the critical features on its way to achieve it. Fig 18 shows the architecture of the proposed CNN model for classifying MFCCs of respiratory sounds.

Recurrent neural network: LSTM.
A recurrent neural network is a class of neural networks where the previous outputs are used as input while having hidden states. The advantage of recurrent neural networks is that the weights are shared across time. On the other hand, vanishing gradients are one of the major problems in RNN. Hence, LSTM was introduced to counter the issue. LSTM has the edge over feed-forward and recurrent neural networks because of their property of selectively memorizing patterns for long durations of time. Fig 19 shows the architecture of the proposed LSTM model for classifying the MFCCs of respiratory sounds over several time frames.

Resnet-50.
The proposed respiratory sound classification method involves a pretrained Resnet-50 model [43], which is considered for both deep feature extraction and finetuning. The input to the model is the MFCC spectrogram of dimension (13,130,1). The fully connected layers were used for predicting the class labels. The output obtained from Resnet is flattened, normalized using BatchNormalization and passed into fully connected layers. The fully connected layer consists of 6 hidden dense layers and one output layer with a softmax activation function. The dense layers have dropouts in the range of 10% to 30% and a batch normalization layer to prevent overfitting. Fig 20 shows the architecture of the proposed transfer learning model that employs the RESNET-50 backbone for extracting features from the MFCCs of respiratory sounds and classifies the same.

Results
In this section, we evaluate the performances of the generative and classification models. To assess the ability of the generative models to synthesize realistic respiratory sounds, we used metrics such as Fréchet Audio Distance (FAD), Cross-Correlation, Mel Cepstral Distortion and compared the features of the synthetic and original audios using Principal Component Analysis (PCA). To evaluate the performance of the classifiers and the effectiveness of the addition of the synthetic samples to the imbalanced training set in combating overfitting for the minority classes, we measured metrics such as confusion matrices, specificity, sensitivity, precision and F1 score on the hold-out test set.

Generative models
7.1.1 Fréchet audio distance. The Fréchet Audio Distance (FAD) is a reference-free evaluation metric used for assessing the quality of synthetic samples created by a generative model [45]. Reference samples for each synthetic audio generated are not required for measuring the FAD metric. Other benefits of using the FAD metric include robustness against noise, computational efficiency, correlates well with the human judgment of audio quality (Pearson Correlation: 0.52), sensitivity to intra-class mode dropping [46]. The FAD metric compares the statistics of embeddings obtained from a VGGish audio classification model for the original and synthetic datasets using Eq 2.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi P where μ r : Mean of real data distribution μ g : Mean of generated data distribution ∑ r : Covariance of real data distribution ∑ g : Covariance of generated data distribution Fig 22 illustrates the computation of FAD. The synthetic and original audios for each lung sound class are inputs to the VGGish model to obtain the corresponding embeddings. The mean and covariance of these embeddings are obtained to compute the FAD metric. Lower the FAD, the smaller the distance between the distributions of the original and synthetic audios. A small FAD is also indicative of the good audio quality of the synthetic samples. Table 5 and the bar graph in Fig 23 show the FAD scores of the synthetic audios generated by the proposed VAEs w.r.t the original sounds for each class. The FAD scores of synthetic audios created by CNN-VAE and Conditional VAE exhibit consistent values in the range of 10-14, whereas those of MLP-VAE vary greatly depending on the lung sound class. The MLP-VAE synthesized audios of the 'Healthy' and 'LRTI' classes with low FAD scores of 4.81 and 3.16, respectively. The synthetic audios are obtained by inverting the Mel spectrograms reconstructed by the decoder of the VAEs. Mel Spectrograms are a lossy compression of STFT as the phase information gets discarded. Hence, to estimate the phase during inversion, we used the iterative Griffin and Lim method [42] offered by the audio processing library Librosa. Hence, distortions introduced by the inversion method may lead to a higher FAD score.

Visualization of audio features using principal component analysis.
The features of the real and synthetic respiratory segments were visualized in two-dimensional space using

Cross-correlation between synthetic and real audios.
Cross-correlation is a measure of how much two signals resemble each other [49]. The higher the correlation, the better is the similarity between the two signals. For 1D time-series x and y, the cross-correlation at lag k, denoted by ϕ xy [k], is given by the Eq (3). From Eq (3), we observe that the cross-correlation operation multiplies a signal y[n] with a shifted version of the signal x[n] by introducing a lag k. This study used Scipy's correlate method to measure the cross-correlation between the real and synthetic audio segments. Since the real and synthetic signals were of the same length and the 'valid' mode argument of Scipy's correlate [50] method was used, we essentially measured the correlation between the real and synthetic signals at lag 0. To estimate the cross-correlation of the synthetic signals w.r.t. the real signals, we took a sample size of 50 from each minority class. The cross-correlation of each of the sampled synthetic signals w.r.t. the sampled real signals were measured, and the maximum correlation for each synthetic sample was selected. The mean and standard deviation of the maximum attainable correlation for each of the sampled synthetic samples w.r.t. the real samples are reported in Table 6. The correlation heatmaps in Figs 27-29 visualize the mean maximum attainable cross-correlation between the sampled synthetic samples of a given class and all minority classes.
The correlation heatmap in Fig 27 shows that the synthetic audio segments created by MLP-VAE are very weakly correlated with the real audio segments for all the classes except LRTI. The mean cross-correlation between the sampled synthetic LRTI signals created by MLP-VAE and the real LRTI signals is 0.79, the highest correlation compared to the The figures show that the synthetic audio segments of a given class show the highest mean correlation w.r.t the real audio segments of a different class. In the case of Conditional VAE, the mean cross-correlations between the real audio segments of a given class w.r.t. the synthetic segments of all classes lie in a very narrow range. For example, the mean cross-correlation between the sampled real Bronchiectasis audio segments and that of synthetic segments lie in the range of 0.41-0.48, which implies that the synthetic samples belonging to the other classes that resemble Bronchiectasis can make it harder for our classifiers to classify the intended minority class correctly. Another example observed in Fig 28 is the high correlation of 0.65 between the synthetic Bronchiectasis samples and the real Pneumonia samples, which can cause the classifiers to misclassify real Bronchiectasis samples as Pneumonia, potentially hurting the performance of our classifiers with the Conditional VAE augmentation. A potential explanation for such observations could be that the features of the wheezes and crackles observed for a minority class such as Bronchiectasis are very similar to those observed in another minority class such as Pneumonia.

Mel Cepstral Distortion.
The Mel Cepstral Distortion (MCD) measures the difference between two sequences of mel cepstra. To account for the differences in timing, we used Dynamic Time Warping (DTW) to identify the best possible alignment between the real and synthetic audio signal. We measured the difference MCD (C ti ;Ĉ ti ) between the mel frequency cepstral coefficients (MFCCs) of the real C ti and syntheticĈ ti audio segments over T timeframes using Eq (4) ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi X To measure the MCD between the real and synthetic signals of each class, we took samples of size 50 from each class and computed the MCD between each synthetic sample and all the   synthetic samples created by Conditional VAE, the MCD is significantly higher, which can be justified because the Conditional VAE model was trained on all six minority classes simultaneously, giving it lesser time to learn the essential features of the disease classes.

Classification models
Various evaluation metrics used to determine the performance of a classification model are as follows: 1. Confusion Matrix: It provides an insightful idea of the classes that are predicted correctly and incorrectly by the model and a count of errors (False Positives and False Negatives) made. It is useful for measuring other metrics like recall, precision, specificity and accuracy. Fig 31 shows the structure of a confusion matrix and the commonly used terms with a confusion matrix.
2. Precision: Precision (Eq 5) represents the fraction of positive predictions that were actually correct. Hence, it is also called positive predictive value.
3. Recall: Recall (Eq 6) represents the fraction of positive samples that were classified correctly. It is also known as sensitivity or true positive rate.
4. F1-Score: F1 score (Eq 7) is the harmonic mean of precision and recall and is denoted by the formula 5. Specificity: Specificity (Eq 8) represents the fraction of negative samples that were classified correctly. Hence, it is also called the true negative rate.
As indicated in Figs 33 to 37, the classification models can classify only COPD correctly and overfit for the minority classes with an imbalanced training set. This issue is resolved for certain minority classes with the help of our proposed data augmentation techniques. Table 7 shows the average of the classification metrics over all classes, whereas the confusion matrix gives us the actual performance of the classifiers per respiratory class. We trained each classifier on the four training sets (imbalanced + 3 augmented) thrice for 100 epochs. We reported the aggregated metrics in Table 6 and the best confusion matrix obtained from the three trials. To understand the impact of data augmentation on the classification models, consider the LRTI class, which is almost completely misclassified by classifiers trained on the imbalanced training set. In contrast, the classifiers classify LRTI better on the augmented training sets. With the augmented training sets, the misclassification proportion for the LRTI class was reduced from 100% for all classifiers (imbalanced) to 40% for ANN (CNN-VAE augmentation), 50% for CNN (CNN-VAE augmentation), 50% for LSTM (CNN-VAE augmentation), 50% for Resnet-50 (For all three VAEs) and 30% (Conditional VAE augmentation). The improvement in the performance metrics for the LRTI class after augmentation can also be observed in Fig 32, which compares the mean F1 score of the classifiers on the test set per class. Fig 32 for the LRTI class reveals that most of the classifiers (except LSTM and Resnet-50) completely misclassified LRTI with an imbalanced training set achieving a mean F1 score of 0; which improved to a minimum F1 score of 0.47 for MLP-VAE augmentation (except ANN), 0.37 for CNN-VAE augmentation (except ANN) and 0.52 for Conditional VAE augmentation (except ANN and CNN). On observing the performance of the classifiers on the augmented training sets in Fig 32, it can be seen that the MLP-VAE and CNN-VAE augmentations were more effective than Conditional VAE. The confusion matrices in Figs 33-37 for augmented training set 3 (Conditional VAE) show poor performance for all classifiers except Resnet-50 and Efficient Net B0 On comparing the VAE augmentations, we can conclude that the MLP-VAE and CNN-VAE augmentations were more effective in enhancing the classification performance metrics than Conditional-VAE. Our evaluation of the generative models can justify this observation, which revealed issues with the Conditional VAE model, such as the resemblance of the synthetic audios of a given class with another minority class and higher MCD of the synthetic audios than the MLP-VAE and CNN-VAE.
We used Excel's "two-way ANOVA (Analysis of Variance) with Replication" for comparing the performance metrics on the imbalanced and augmented training sets. The null hypothesis H 0 assumed for the test is "No significant difference in the performance metrics was observed with the augmented training sets", and the alternate hypothesis H 1 is "A significant difference in the performance metrics was observed with the augmented training sets". Table 8 reports the statistical test results for comparing performance metrics of various classifiers on the imbalanced and augmented training sets. The p-value represents the probability of the null hypothesis H 0 being true. A significance level (SL) of 0.05 was chosen for these tests. Table 8 shows that the p-values corresponding to the above hypothesis tests for all performance metrics were less than the chosen SL. Also, the F ratio was greater than the critical F value for all performance metrics except specificity. These test results indicate that the null hypothesis H 0 can be rejected, and a significant difference in the performance metrics is observed with the augmented training sets.

Discussion
This study aimed to evaluate the impact of augmenting the imbalanced ICBHI dataset with the synthetic samples created by VAEs to achieve superior classification results for detecting multiple respiratory diseases. Our results showed that the synthetic audio segments created by MLP-VAE and CNN-VAE enhanced the classification performance metrics more than Conditional VAE. The VAE augmentation helped our classifiers achieve superior performance in detecting multiple diseases. In this section, we compared the results of our classifiers with those of existing respiratory disease classification models. We observed that very few works proposed models for the classification of multiple respiratory diseases in the literature. It is worth mentioning that a direct quantitative comparison of results is not possible due to the variations in datasets (sample size and disease classes involved), feature extraction techniques, classification algorithms and evaluation methods. Table 9 and Fig 38 collectively present the comparative summary of our work with recent works undertaken towards diseased respiratory sounds classification. Table 9 reveals that the sensitivity of our models is lower than those achieved by existing works. However, it is worth noting that the sensitivity metric in our case includes six disease classes, unlike other works where only two or four classes have been considered. [25] proposed a deep learning-based auscultation framework consisting of RNN and its variants for detecting abnormal respiratory sounds (such as crackles, wheezes or both) and pathological lung diseases (such as chronic, non-chronic and healthy). Their results showed that MFCCs computed with a window step size leading to a 50% overlap between successive time frames lead to an LSTM with the best performance (ICBHI Score: 0.90, Sensitivity: 0.98 and Specificity: 0.82). Even though our work has not focused on classifying adventitious respiratory sounds such as wheezes and crackles, we have built models that can predict a specific disease rather than only its category (chronic or non-chronic). This was made possible due to the voluminous synthetic data we generated using VAEs to help our classifiers generalize better on the minority classes. Our proposed LSTM also extracted features from MFCCs of the respiratory sounds computed on overlapping time-frames; however, the extracted features were not sufficient for classifying the respiratory diseases with the imbalanced training set. Upon the VAE augmentation, the performance of our LSTM model improved significantly for certain disease classes and the Healthy class, achieving an ICBHI score of 0.67 (Sensitivity: 0.41 and Specificity: 0.92).
[30] developed 3-class and 6-class (healthy and five diseases) classification models for diagnosing respiratory diseases in the ICBHI dataset. Similar to our work, these authors proposed a CNN-VAE for augmenting the 6-class imbalanced dataset. The augmentation led to state-ofthe-art results for their models with an ICBHI score of at least 0.99. Our work not only focused on 7-class (healthy + six diseases) classification models but also proposed a Conditional VAE, which allowed simultaneous training of the VAE for all minority classes. The disadvantage of CNN-VAE is that there is no way to control the output synthesized class, forcing us to train CNN-VAE for each respiratory class separately. Upon completing the training for a given class, the generative model can synthesize mel spectrograms for that class. As our classifiers rely on MFCCs, we needed to invert the synthetic mel spectrograms to raw audios. Training CNN-VAE for each class separately and inverting the synthetic mel spectrograms to raw audios was time-consuming. To overcome this limitation of CNN-VAE, we experimented with a Conditional VAE, which could be guided via a label to generate mel spectrograms of specific classes, allowing us to train the VAE with all classes simultaneously. As a result, the time required for synthesizing the augmented training samples for all classes was reduced; however, our results showed that the Conditional VAE augmentation was harmful to the classifiers. Hence, we conclude that the unconditional models such as MLP-VAE and CNN-VAE, though time-consuming for training and generating synthetic samples, are suitable generative models for augmenting the respiratory sounds database.
[47] divided the ICBHI dataset into two classes consisting of COPD and non-COPD. They analyzed several audio representations such as MFCCs, Mel Spectrograms, Chroma STFT, Chroma CQT and Chroma CENS to identify the most effective audio representation for feature extraction using a CNN. They discovered that MFCCs and Mel Spectrograms led to the highest ICBHI scores of 0.77 and 0.73. They further improved these scores to 0.92 and 0.82 by augmenting their training set using augmentation techniques such as loudness augmentation, shift augmentation, mask augmentation and speed augmentation. [48] proposed using entropy-based features such as Shannon entropy, logarithmic energy entropy and spectral entropy for classifying audio segments of multiple respiratory diseases. They used ensemble techniques such as bagging and boosting on classifiers such as Decision Tree and Linear Discriminant Analysis to enhance performance metrics such as sensitivity and specificity over all classes.

Conclusion
To conclude, this paper investigated the impact of augmenting the imbalanced ICBHI dataset with synthetic audio segments created by VAEs on various classifiers. The resemblance between the features of the synthetic audio segments and that of the original was measured using metrics such as Fréchet Audio Distance, Cross-Correlation and Mel Cepstral Distortion. The results showed that the unconditional generative models such as MLP-VAE and CNN-VAE were more effective in enhancing the performance metrics of the classifiers than the Conditional VAE. The PCA plots for CNN-VAE indicated that generated audio segments also exhibited better variety in terms of acoustic features when compared to those created by MLP-VAE. Lastly, the classifiers completely misclassified certain diseases such as LRTI when trained on an imbalanced training set. Upon augmenting the imbalanced training set, a significant boost in the performance metrics of the classifiers was observed for certain minority classes and marginal improvement for the other classes.